16 research outputs found

    Lessons from the CAGI-4 Hopkins clinical panel challenge

    Get PDF
    The CAGI-4 Hopkins clinical panel challenge was an attempt to assess state of the art methods for clinical phenotype prediction from DNA sequence. Participants were provided with exonic sequences of 83 genes for 106 patients from the Johns Hopkins DNA Diagnostic Laboratory. Five groups participated in the challenge, predicting both the probability that each patient had each of fourteen possible classes of disease, as well as one or more causal variants. In cases where the Hopkins laboratory reported a variant, at least one predictor correctly identified the disease class in 36 of 43 patients (84%). Even in cases where the Hopkins laboratory did not find a variant, at least one predictor correctly identified the class in 39 of 63 patients (62%). Each prediction group correctly diagnosed at least one patient that was not successfully diagnosed by any other groups. We discuss the causal variant predictions by the different groups and their implications for further development of methods to assess variants of unknown significance. Our results suggest that clinically relevant variants may be missed when physicians order small panels targeted on a specific phenotype. We also quantify the false positive rate of DNA-guided analysis in the absence of prior phenotypic indication. This article is protected by copyright. All rights reserved

    IgTM: An algorithm to predict transmembrane domains and topology in proteins

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Due to their role of receptors or transporters, membrane proteins play a key role in many important biological functions. In our work we used Grammatical Inference (GI) to localize transmembrane segments. Our GI process is based specifically on the inference of Even Linear Languages.</p> <p>Results</p> <p>We obtained values close to 80% in both specificity and sensitivity. Six datasets have been used for the experiments, considering different encodings for the input sequences. An encoding that includes the topology changes in the sequence (from inside and outside the membrane to it and vice versa) allowed us to obtain the best results. This software is publicly available at: <url>http://www.dsic.upv.es/users/tlcc/bio/bio.html</url></p> <p>Conclusion</p> <p>We compared our results with other well-known methods, that obtain a slightly better precision. However, this work shows that it is possible to apply Grammatical Inference techniques in an effective way to bioinformatics problems.</p

    Improved estimators of common variance of p-populations when Kurtosis is known

    No full text
    Hessian, Kurtosis, Mean-squared error, Non-singular matrix, Pooled sample variance, Positive definite matrix, Relative efficiency,

    A stochastic context free grammar based framework for analysis of protein sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm.</p> <p>Results</p> <p>This framework was implemented in a system aiming at the production of binding site descriptors. These descriptors not only allow detection of protein regions that are involved in these sites, but also provide insight in their structure. Grammars were induced using quantitative properties of amino acids to deal with the size of the protein alphabet. Moreover, we imposed some structural constraints on grammars to reduce the extent of the rule search space. Finally, grammars based on different properties were combined to convey as much information as possible. Evaluation was performed on sites of various sizes and complexity described either by PROSITE patterns, domain profiles or a set of patterns. Results show the produced binding site descriptors are human-readable and, hence, highlight biologically meaningful features. Moreover, they achieve good accuracy in both annotation and detection. In addition, findings suggest that, unlike current state-of-the-art methods, our system may be particularly suited to deal with patterns shared by non-homologous proteins.</p> <p>Conclusion</p> <p>A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences. Experiments have shown that not only is this new approach valid, but produces human-readable descriptors for binding sites which have been beyond the capability of current machine learning techniques.</p

    Systematic discovery of structural elements governing stability of mammalian messenger RNAs

    No full text
    Decoding post-transcriptional regulatory programs in RNA is a critical step in the larger goal to develop predictive dynamical models of cellular behavior. Despite recent efforts1–3, the vast landscape of RNA regulatory elements remain largely uncharacterized. A longstanding obstacle is the contribution of local RNA secondary structure in defining interaction partners in a variety of regulatory contexts, including but not limited to transcript stability3, alternative splicing4 and localization3. There are many documented instances where the presence of a structural regulatory element dictates alternative splicing patterns (e.g. human cardiac troponin T) or affects other aspects of RNA biology5. Thus, a full characterization of post-transcriptional regulatory programs requires capturing information provided by both local secondary structures and the underlying sequence3,6. We have developed a computational framework based on context-free grammars3,7 and mutual information2 that systematically explores the immense space of small structural elements and reveals motifs that are significantly informative of genome-wide measurements of RNA behavior. The application of this framework to genome-wide mammalian mRNA stability data revealed eight highly significant elements with substantial structural information, for th
    corecore